Lakshman Kumar S

Part A - 30 Marks

DOMAIN: Telecom

CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata. Thedata set includes information about:

• Customers who left within the last month – the column is called Churn

• Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

• Demographic info about customers – gender, age range, and if they have partners and dependents

PROJECT OBJECTIVE: To Build a model that will help to identify the potential customers who have a higher probability to churn. This helps the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategizing customer retention.

STEPS AND TASK [30 Marks]:

1. Data Understanding & Exploration: [5 Marks]

A. Read ‘TelcomCustomer-Churn_1.csv’ as a DataFrame and assign it to a variable. [1 Mark]

B. Read ‘TelcomCustomer-Churn_2.csv’ as a DataFrame and assign it to a variable. [1 Mark]

C. Merge both the DataFrames on key ‘customerID’ to form a single DataFrame [2 Mark]

D. Verify if all the columns are incorporated in the merged DataFrame by using simple comparison Operator in Python. [1 Marks]

2. Data Cleaning & Analysis: [5 Marks]

A. Impute missing/unexpected values in the DataFrame. [2 Marks]

B. Make sure all the variables with continuous values are of ‘Float’ type. [2 Marks]

[For Example: MonthlyCharges, TotalCharges]

C. Create a function that will accept a DataFrame as input and return pie-charts for all the appropriate Categorical features. Clearly show percentage distribution in the pie-chart. [4 Marks]

D. Share insights for Q2.c. [2 Marks]

From the pie chart plot we could observe that,

  1. In total, there are 16 categorical variables
  2. Of which, for 6 variables there are only two categories either "yes" or "no"
  3. "InternetService","Contract","PaymentMethod" are one hot columns, as each value is independant of each other and can't be ranked.
  4. In the dataset, 50.48% are Male and remaining 48.52% are Females
  5. About 48.3% customers have Partners and 51.7% do not have Partners
  6. 29.96% of Customers have dependants, whereas 70.4% do not have any Dependants
  7. A whoping percent of 90.32% customers have Phoneservice and only 9.68% don't have a phoneservice
  8. Of the 90.32% customers with phoneservice, 42.18% has multiplelines
  9. 78.33% of customers have internetservice, of which 34.37% are DSL and remaining 43.96% are Fiberoptics
  10. Only 28.67% of customers having internetservice has Onlinesecurity
  11. 34.49% of customers having internetservice has Onlinebackup
  12. And, 34.39% of customers having internetservice has DeviceProtection
  13. Only, 29.02% of customers having internetservice has Techsupport
  14. Considering the customers having internetservice 38.44% has StreamingTV and 38.79% has StreamingMovies
  15. There are 3 categories in Contract, Month-to-Month with 55.02% customers, One year with 20.91% customers and Two year with 24.07%
  16. About 59.22% of customers have opted for paperless billing
  17. In total there are 4 ways of payment, namely - Credit card (automatic) - 21.61% customers, bank transfer (automatic) - 21.92% customers, Mailed check - 22.89% customers and Electronic check - 33.58% customers
  18. Finally, in the given data set only 26.54% customers churn and remaining 73.46% cutomers did not churn. This clearly shows that our dataset is imbalanced and we need to either oversample or undersample to make them balanced.

These are the insights we can infer from the pie charts and these insights are very useful to proceed with encoding categorical features and with the modelling part.

E. Encode all the appropriate Categorical features with the best suitable approach. [2 Marks]

F. Split the data into 80% train and 20% test. [1 Marks]

G. Normalize/Standardize the data with the best suitable approach. [2 Marks]

3. Model building and Improvement: [10 Marks]

A. Train a model using XGBoost. Also print best performing parameters along with train and test performance. [5 Marks]

B. Improve performance of the XGBoost as much as possible. Also print best performing parameters along with train and test performance. [5 Marks]

Conclusion

The classification goal is to predict the potential customers who have a higher probability to churn.

Most of the ML models works best when the number of classes are in equal proportion since they are designed to maximize accuracy and reduce error. Thus, they do not take into account the class distribution / proportion or balance of classes. In our dataset, the potential customers who have a higher probability to churn (chrun 'yes' i.e. 1) is 26.5% whereas those about 73.5% of the customers didn't churn (Chrun 'no' i.e. 0).

In this cases, important performance measures such as precision, recall, and f1-score would be helpful. We can also calculate this metrics for the minority, positive, class.

The confusion matrix for class 1 (Churn) would look like:

Predicted: 0 (Not Chrun) Predicted: 1 (Churn)
Actual: 0 (Not Subscribed) True Negatives False Positives
Actual: 1 (Subscribed) False Negatives True Positives

In our case, it would be recall that would hold more importance then precision. So choosing recall particularly for class 1 and accuracy as as evaluation metric. Also important would be how is model behaving over the training and test scores across the cross validation sets.

Modeling was sub-divided in two phases, in the first phase we applied XGB without hyperparameter tuning. In second phase applied XGB with hyperparameter tuning. Oversampling the ones with higher accuracy and better recall.

Oversampling, which is one of common ways to tackle the issue of imbalanced data. Over-sampling refers to various methods that aim to increase the number of instances from the underrepresented class in the data set. Out of the various methods, we chose Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE’s main advantage compared to traditional random naive over-sampling is that by creating synthetic observations instead of reusing existing observations, classifier is less likely to overfit.

In the first phase (XGB without Hyperparameter Tuning),

In the second phase (XGB with Hyperparameter Tuning),

Part B - 30 Marks

DOMAIN: IT

CONTEXT: The purpose is to build a machine learning workflow that will work autonomously irrespective of Data and users can save efforts involved in building workflows for each dataset.

PROJECT OBJECTIVE: Build a machine learning workflow that will run autonomously with the csv file and return best performing model.

STEPS AND TASK [30 Marks]:

1. Build a simple ML workflow which will accept a single ‘.csv’ file as input and return a trained base model that can be used for predictions. You can use 1 Dataset from Part 1 (single/merged).

2. Create separate functions for various purposes.

3. Various base models should be trained to select the best performing model.

4. Pickle file should be saved for the best performing model.

Include best coding practices in the code:

• Modularization • Maintainability • Well commented code etc.

Defining Main Function

Executing Main Function

To use the best performing model for next set of data

Model Evaluation

Conclusion

The classification goal is to predict the potential customers who have a higher probability to churn.

Most of the ML models works best when the number of classes are in equal proportion since they are designed to maximize accuracy and reduce error. Thus, they do not take into account the class distribution / proportion or balance of classes. In our dataset, the potential customers who have a higher probability to churn (chrun 'yes' i.e. 1) is 26.5% whereas those about 73.5% of the customers didn't churn (Chrun 'no' i.e. 0).

In this cases, important performance measures such as precision, recall, and f1-score would be helpful. We can also calculate this metrics for the minority, positive, class.

The confusion matrix for class 1 (Churn) would look like:

Predicted: 0 (Not Chrun) Predicted: 1 (Churn)
Actual: 0 (Not Subscribed) True Negatives False Positives
Actual: 1 (Subscribed) False Negatives True Positives

In our case, it would be recall that would hold more importance then precision. So choosing recall particularly for class 1 and accuracy as as evaluation metric. Also important would be how is model behaving over the training and test scores across the cross validation sets.

Modeling was sub-divided in two phases, in the first phase we applied standard models (with and without the hyperparameter tuning wherever applicable) such as Logistic Regression, k-Nearest Neighbor and Naive Bayes classifiers. In second phase apply ensemble techniques such as Decision Tree, Bagging, AdaBoost, Gradient Boosting and Random Forest classifiers. Oversampling the ones with higher accuracy and better recall for subscribe.

Oversampling, which is one of common ways to tackle the issue of imbalanced data. Over-sampling refers to various methods that aim to increase the number of instances from the underrepresented class in the data set. Out of the various methods, we chose Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE’s main advantage compared to traditional random naive over-sampling is that by creating synthetic observations instead of reusing existing observations, classifier is less likely to overfit.

In the first phase (Standard machine learning models),

In the second phase (Ensemble models),